emaldonado1127@floridapoly.comlibrary(tidyverse)
Warning: package ‘tidyverse’ was built under R version 4.1.3
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
-- Attaching packages ----------------------------------------------------------------------------------------- tidyverse 1.3.1 --
v ggplot2 3.3.6 v purrr 0.3.4
v tibble 3.1.5 v dplyr 1.0.9
v tidyr 1.1.4 v stringr 1.4.0
v readr 2.0.2 v forcats 0.5.1
Warning: package ‘ggplot2’ was built under R version 4.1.3
Warning: package ‘dplyr’ was built under R version 4.1.3
-- Conflicts -------------------------------------------------------------------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
cali_birth <- read_csv("C:\\Users\\maldo\\OneDrive\\Desktop\\FloridaPoly\\Data vizualization\\dataviz_mini-project_02\\dataviz_mini-project_02\\data\\california_birth.csv", col_types = cols())
str(cali_birth)
spec_tbl_df [1,764 x 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ year : num [1:1764] 2008 2008 2008 2008 2008 ...
$ patcnty: chr [1:1764] "Alameda" "Alameda" "Alameda" "Alameda" ...
$ agegrp : chr [1:1764] "Total Births" "Older Mothers (35 years old or older)" "Teen Mothers (15 years old to 19 years old)" "Typical Aged Mothers (20 years old to 34 years old)" ...
$ count : num [1:1764] 20470 4714 1319 14422 2464 ...
- attr(*, "spec")=
.. cols(
.. year = col_double(),
.. patcnty = col_character(),
.. agegrp = col_character(),
.. count = col_double()
.. )
- attr(*, "problems")=<externalptr>
For each county, we can see that this data set is divided into “agegrp” groups, one of which is “Total Birth”. We are going to get rid of this because the math does not make sense if we check it, and in addition, it is not required for this analysis.
births_year <- cali_birth %>% group_by(year, agegrp) %>%
filter(agegrp != "Total Births")
births_year
Now that we have cleared it up, let’s check the percentage of each group.
births_yearngroup <- births_year %>%
group_by(year, agegrp) %>%
summarize(total = sum(count))%>%
mutate(freq = total / sum(total),
pct = round((freq*100), 2))
`summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
births_yearngroup
library(plotly)
Attaching package: ‘plotly’
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
P2.1 <- ggplot(births_yearngroup, aes(x = year, y = pct, fill = agegrp)) +
geom_col(position = "dodge2") +
labs(title = "Birth in California by a group of women",x = "Years",y = "Percent", fill = "Groups")+
coord_cartesian(xlim =c(2008, 2016)) +
scale_fill_brewer(type = "qual", palette = "Dark2")
my_plot <- ggplotly(P2.1)
my_plot
Comment
It is important to note that the percentage of typical age mothers has not changed significantly, remaining at 72.94 to 73.86.
This gives us a good perspective since they are the majority in this data set.
library(htmlwidgets)
saveWidget(my_plot, "my_plot.html")
library(sf)
Linking to GEOS 3.9.1, GDAL 3.2.1, PROJ 7.2.1; sf_use_s2() is TRUE
# Load
cali_counties <- read_sf("C:\\Users\\maldo\\OneDrive\\Desktop\\FloridaPoly\\Data vizualization\\dataviz_mini-project_02\\dataviz_mini-project_02\\data\\ca-county-boundaries\\CA_Counties\\CA_Counties_TIGER2016.shp")
cali_counties
Simple feature collection with 58 features and 17 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -13857270 ymin: 3832931 xmax: -12705030 ymax: 5162404
Projected CRS: WGS 84 / Pseudo-Mercator
To see which counties contribute the most births, we’ll need to establish a new dataframe.
births_yearTotal <- cali_birth %>%
group_by(year, agegrp) %>%
filter(agegrp == "Total Births")
births_yearTotal
colnames(births_yearTotal)[2] <- "NAME"
To make these two date sets plot, let’s mix them.
birth_map <- cali_counties %>%
left_join(births_yearTotal, by = "NAME")
P2.2 <- ggplot(birth_map) +
geom_sf(aes(fill = count),
alpha=0.9, col="white") +
scale_fill_viridis_c(name = "Births", trans = "log2", option = "plasma") +
labs(title = "Birth in California")
P2.2
library("svglite")
Warning: package ‘svglite’ was built under R version 4.1.3
ggsave("Birth in California.jpg", P2.2)
Saving 7 x 7 in image
Comment
There is a county in yellow, which is LA, that has the highest number of births. There are also more in orange that could be big cities like San Francisco.
library(broom)
births_yearTotal%>%
top_n(count, n = 5)
births_yearLA <- cali_birth %>% group_by(year, agegrp) %>%
filter(agegrp == "Older Mothers (35 years old or older)" & patcnty == "Los Angeles")
births_yearLA
birth_model <- lm(count ~ year, data = births_yearLA)
P2.3 <- ggplot(births_year2, aes(x = year, y = count)) +
geom_point() +
geom_smooth(method = "lm",
formula = "y ~ x") +
theme_minimal()
P2.3
Comment As we’ve seen, LA has one of the highest birth rates in California. We’ll most likely need to compare to a similar-sized city. For the time being, we’re seeing a linear model around the year in question.
birth_model2<- tidy(birth_model, conf.int = TRUE)%>%
filter(term != "(Intercept)")
glance(birth_model)
P2.4 <- ggplot(birth_model2,
aes(x = estimate,
y = fct_rev(term))) +
geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
geom_vline(xintercept = 0, color = "purple") +
theme_minimal()
P2.4